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Abstract In this paper, we present several descent methods that can be apphed 
to nonnegative matrix factorization and we analyze a recently devel- 
opped fast block coordinate method called Rank-one Residue Iteration 
(RRI). We also give a comparison of these different methods and show 
that the new block coordinate method has better properties in terms 
of approximation error and complexity. By interpreting this method as 
a rank-one approximation of the residue matrix, we prove that it con- 
verges and also extend it to the nonnegative tensor factorization and 
introduce some variants of the method by imposing some additional 
controllable constraints such as: sparsity, discreteness and smoothness. 
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1. Introduction 

Linear algebra has become a key tool in almost all modern techniques 
for data analysis. Most of these techniques make use of linear subspaces 
represented by eigenvectors of a particular matrix. In this paper, we 
consider a set of n data points ai, 02, . . . , a^, where each point is a real 
vector of size m, Oj G M™. We then approximate these data points by 
linear combinations of r basis vectors Ui G M"^: 



E 



Vij G 



Uj G 
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This can be rewritten in matrix form as ^ ~ UV , where and Ui are 
respectively the columns of A and U and the Vi/s are the elements of V. 
Optimal solutions of this approximation in terms of the Euclidean (or 
Frobenius) norm can be obtained by the Singular Value Decomposition 
(SVD) [13]. 

In many cases, data points are constrained to a subset of M™. For 
example, light intensities, concentrations of substances, absolute tem- 
peratures are, by their nature, nonnegative (or even positive) and lie in 
the nonnegative orthant W^. The input matrix A then becomes elemen- 
twise nonnegative and it is then natural to constrain the basis vectors Vi 
and the coefficients Vij to be nonnegative as well. In order to satisfy this 
constraint, we need to approximate the columns of A by the following 
additive model: 



where the Vij coefficients and Uj vectors are nonnegative, Vij G M+, Uj G 

Many algorithms have been proposed to find such a representation, 
which is referred to as a Nonnegative Matrix Factorization (NMF). The 
earliest algorithms were introduced by Paatero [25, 26]. But the topic 
became quite popular with the publication of the algorithm of Lee and 
Seung in 1999 [20] where multiplicative rules were introduced to solve 
the problem. This algorithm is very simple and elegant but it lacks 
a complete convergence analysis. Other methods and variants can be 
found in [23], [21], [17]. 

The quality of the approximation is often measured by a distance. 
Two popular choices are the Euclidean (Frobenius) norm and the gen- 
eralized Kullback-Leibler divergence. In this paper, we focus on the 
Euclidean distance and we investigate descent methods for this mea- 
sure. One characteristic of descent methods is their monotonic decrease 
until they reach a stationary point. This point maybe located in the in- 
terior of the nonnegative orthant or on its boundary. In the second case, 
the constraints become active and may prohibit any further decrease of 
the distance measure. This is a key issue to be analyzed for any descent 
method. 

In this paper, R™ denotes the set of nonnegative real vectors (elemen- 
twise) and [v]+ the projection of the vector v on R™. We use v > and 
A>0 to denote nonnegative vectors and matrices and v > and A> 
to denote positive vectors and matrices. Ao B and |4t respectively 
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the Hadamard (elementwise) product and quotient. A-i and Ai- are the 
z*'* column and i*^ tow of A. 

This paper is an extension of the internal report [15], where we pro- 
posed to decouple the problem based on rank one approximations to 
create a new algorithm called Rank-one Residue Iteration (RRI). Dur- 
ing the revision of this report, we were informed that essentially the 
same algorithm was independently proposed and published in [8] un- 
der the name Hierarchical Alternative Least Squares (HALS). But the 
present paper gives several additional results wherein the major contri- 
butions are the convergence proof of the method and its extensions to 
many pratical situations and constraints. The paper also compares a 
selection of some recent descent methods from the literature and aims 
at providing a survey of such methods for nonnegative matrix factor- 
izations. For that reason, we try to be self-contained and hence recall 
some well-known results. We also provide short proofs when useful for 
a better understanding of the rest of the paper. 

We first give a short introduction of low rank approximations, both 
unconstrained and constrained. In Section 3 we discuss error bounds of 
various approximations and in Section 4 we give a number of descent 
methods for Nonnegative Matrix Factorizations. In Section 5 we de- 
scribe the method based on successive rank one approximations. This 
method is then also extended to approximate higher order tensor and to 
take into account other constraints than nonnegativity. In Section 5.0 
we discuss various regularization methods and in Section 6, we present 
numerical experiments comparing the different methods. We end with 
some concluding remarks. 

2. Low- rank matrix approximation 

Low-rank approximation is a special case of matrix nearness problem 
[14]. When only a rank constraint is imposed, the optimal approximation 
with respect to the Frobenius norm can be obtained from the Singular 
Value Decomposition. 

We first investigate the problem without the nonnegativity constraint 
on the low-rank approximation. This is useful for understanding prop- 
erties of the approximation when the nonnegativity constraints are im- 
posed but inactive. We begin with the well-known Eckart- Young Theo- 
rem. 
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Theorem 2.1 (Eckart- Young) . Let A G K"*x"- (m>n) have the singu- 
lar value decomposition 



( ai ... Q \ 
0-2 ••• 







V 



/ 



where o"i > o"2 > . . . > 0"^ > are the singular values of A and where 
P G j^'^x'" (xnd Q £ M^xn ^y.g orthogonal matrices. Then for 1 < r < n, 
the matrix 



/ ai 

(72 







... 0\ 
... 



V 

is a global minimizer of the problem 

1, 



0/ 
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B^Rmxn rank(B)<r 2 



A-B\\l 



(1.1) 



ancJ iis error is 



\A-B\\l 



1 " 



i=r+l 



Moreover, if Ur > cr^+i then A^ is the unique global minimizer. 

The proof and other impHcations can be found for instance in [13]. 
The columns of P and Q are called singular vectors of A, in which 
vectors corresponding to the largest singular values are referred to as 
the dominant singular vectors. 

Let us now look at the following modified problem 



min -\\A- XY'^fp, 



(1.2) 



where the rank constraint is implicit in the product since the 

dimensions of X and Y guarantee that rank(XY^) < r. Conversely, 
every matrix of rank less than r can be trivially rewritten as a product 
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XY^, where X G M'"^'' and Y G M'^^^ Therefore Problems (1.1) and 
(1.2) are equivalent. But even when the product Ay. = XY^ is unique, 
the pairs {XR'^ , YR~^) with R invertiblc, yield the same product XY^. 
In order to avoid this, we can always choose X and Y such that 

X = PD^ and Y = QD^, (1.3) 

where P^P = Irxr, Q^Q = Irxr and D is r x r nonnegative diagonal 
matrix. Doing this is equivalent to computing a compact SVD decom- 
position of the product Ar = XY^ = PDQ^ . 

As usual for optimization problems, we calculate the gradient with 
respect to X and Y and set them equal to 0. 

Vx = XY'^Y -AY = Q Vy = YX'^X - A^ X = 0. (1.4) 

If we then premultiply A?^ with Vx and A with Vy, we obtain 

{A^A)Y = {A^X)Y^Y {AA^)X = {AY)X^X. (1.5) 

Replacing A^X = YX^X and AY = XY'^Y into (1.5) yields 

{A^A)Y = YX'^XY'^Y {AA^)X = XY'^YX'^X. (1.6) 

Replacing (1.3) into (1.6) yields 

{A^A)QD^ = QDP^PDQ^QD^ and {AA'^)PDh = PDQ'^QDP^PD^ . 

When D is invertible, this finally yields 

{A^A)Q = QD^ and {AA^)P = PD^. 

This shows that the columns of P and Q are singular vectors and 
Da's are nonzero singular values of A. Notice that if D is singular, one 
can throw away the corresponding columns of P and Q and reduce it to 
a smaller-rank approximation with the same properties. Without loss of 
generality, we therefore can focus on approximations of Problem (1.2) 
which are of exact rank r. We can summarize the above reasoning in 
the following theorem. 

Theorem 2.2. Let A e M'"''" (m > n and rank{A) = t). If Ar (I < 
r <t) is a rank r stationary point of Problem 1.2, then there exists two 
orthogonal matrices P G R"*^"* and Q G M"^" such that: 

A = PtQ^ and Ar = PtrQ^ 
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where 



/ <Tl 

6-2 

V 
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and the a[s are unsorted singular values of A. Moreover, the approxi- 
mation error is: 



This result shows that, if the singular values are all different, there are 
r\(n-ry. PO^sible stationary points Ar. When there are multiple singu- 
lar values, there will be infinitely many stationary points Ar since there 
are infinitely many singular subspaces. The next result will identify the 
minima among all stationary points. Other stationary points are sad- 
dle points whose every neighborhood contains both smaller and higher 
points. 



Theorem 2.3. The only minima of Problem 1.2 are given by Theorem 
2.1 and are global minima. All other stationary points are saddle points. 



Proof. Let us assume that A^ is a stationary point given by Theorem 
2.2 but not by Theorem 2.1. Then there always exists a permutation of 
the columns of P and Q, and of the diagonal elements of S and such 
that dr+i > ^r- We then construct two points in the e-neighborhood of 
Ar that yield an increase and a decrease, respectively, of the distance 
measure. They are obtained by taking: 



/ ai + e 



S.(€) 



. . . (Tr 



Arie) = PT.r{e)Q 



T 
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Clearly ^^(e) and A^{e) are of rank r. Evaluating the distance measure 
yields 



i=r+2 
t 

= e^[e^ - 2{ur+i - ar)] + ^ ^, 



for all e G (0, ■\j2{ar+\ — (5"^)) and 



i=r+\ 



t 

< ^a^ = \\A-A 

i=r+l 



2 

r\\F 



^..^„2 _.2 , Y,af>Y: -I 
i=r+l i=r+l 



\A-Ar{e)rF = e' + 



I 1 1 f 



for all e > 0. Hence, for an arbitrarily small positive e, we obtain 

\\A - A,{e)fp < \\A - ArWl < \\A - Me)\\F 
which shows that A^ is a saddle point of the distance measure. 



□ 



When we add a nonnegativity constraint in the next section, the re- 
sults of this section will help to identify stationary points at which all 
the nonnegativity constraints are inactive. 

3. Nonnegativity constraint 

In this section, we investigate the problem of Nonnegative Matrix 
Factorization. This problem differs Problem 1.2 in the previous section 
because of the additional nonnegativity constraints on the factors. We 
first discuss the effects of adding such a constraint. By doing so, the 
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problem is no longer easy because of the existence of local minima at 
the boundary of the nonnegativc orthant. Determining the lowest min- 
imum among these minima is far from trivial. On the other hand, a 
minimum that coincides with a minimum of the unconstrained problem 
(i.e. Problem 1.2) may be easily reached by standard descent methods, 
as we will see. 

Problem 1 (Nonnegative matrix factorization - NMF). Given a mx n 
nonnegative matrix A and an integer r < min(m, n), solve 



1 



mm ^WA-UV^fp. 



Where r is called the reduced rank. Prom now on, m and n will be 
used to denote the size of the target matrix A and r is the reduced rank 
of a factorization. 

We rewrite the nonnegative matrix factorization as a standard non- 
linear optimization problem: 

min -\\A-UV'^\\l. 
-u<o -v<o 2 

The associated Lagrangian function is 

L{U,V,fi,u) = ^\\A-UV^\\l- noU -uoV, 

where p, and v are two matrices of the same size of U and V, respec- 
tively, containing the Lagrange multipliers associated with the nonneg- 
ativity constraints Uij > and Vij > 0. Then the Karush-Kuhn- Tucker 
conditions for the nonnegative matrix factorization problem say that if 
([/, V) is a local minimum, then there exist /lij > and Vij > such 
that: 

U>0 , V>0, (1.7) 
VLu = , VLv = 0, (1.8) 
IJ,oU = , i^oV = 0. (1.9) 

Developing (1.8) we have: 

AV - UV^V - /i = 0, A^U - VU^U -u = 



or 

H = -{UV^V-AV), 



V = -{VU'^U - A^U). 
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Combining tliis with > 0, Uij > and (1.9) gives the following 
conditions: 

?7>0 , F>0, (1.10) 

VFu = UV^V - AV >{) , VFv = VU'^U-A'^U>0,{l.ll) 

U o {UV^V - AV) = , Vo{VU^U-A^U)=0, (1.12) 

where the corresponding Lagrange multipliers for U and V are also the 
gradient of F with respect to U and V. Since the Euclidean distance is 
not convex with respect to both variables U and V at the same time, 
these conditions are only necessary. This is implied because of the ex- 
istence of saddle points and maxima. We then call all the points that 
satisfy the above conditions, the stationary points. 

Definition 1 (NMF stationary point). We call (U, V) a stationary point 
of the NMF Problem if and only if U and V satisfy the KKT conditions 
(1.10), (1.11) and (1.12). 

Alternatively, a stationary point (?7, V) of the NMF problem can also 
be defined by using the following necessary condition (see for example 
[4]) on the convex sets W^^'^'' and that is 

vFv)'(y-y ))-°' vxGMr^^Gn"^ (i-i3) 

which can be shown to be equivalent to the KKT conditions (1.10), (1-11) 
and (1.12). Indeed, it is trivial that the KKT conditions imply (1.13). 
And by carefully choosing different values of X and Y from (1.13), one 
can easily prove that the KKT conditions hold. 

There are two values of reduced rank r for which we can trivially 
identify the global solution which are r = 1 and r = min(m, n). For 
r = 1, a pair of dominant singular vectors are a global minimizer. And 
for r = min(m, n), (?7 = ^, = I) is a global minimizer. Since most 
of existing methods for the nonnegative matrix factorization are descent 
algorithms, we should pay attention to all local minimizers. For the 
rank-one case, they can easily be characterized. 

Rank one case 

The rank-one NMF problem of a nonnegative matrix A can be rewrit- 
ten as ^ 

min -WA-uv^Wl^ (1-14) 

ueM^p vim-l 2 

and a complete analysis can be carried out. It is well known that any pair 
of nonnegative Perron vectors of AAl'- and Al'- A yields a global minimizer 



10 

of this problem, but we can also show that the only stationary points 
of (1.14) are given by such vectors. The following theorem excludes the 
case where u = and/or v = 0. 

Theorem 3.1. The pair (u, v) is a local minimizer of (1-14) if o,nd only 
if u and v are nonnegative eigenvectors of AA^ and A^A respectively of 
the eigenvalue a = HwHill^lli- 

Proof. The if part easily follows from Theorem 2.2. For the only if part 
we proceed as follows. Without loss of generality, we can permute the 
rows and columns of A such that the corresponding vectors u and v 
are partitioned as (u-i- 0)"^ and {v-^- 0)"^ respectively, where i;+ > 0. 
Partition the corresponding matrix A conformably as follows 



A 



An Ai2 
A21 A22 



then from (1.11) we have 

^ u+vl \ f v+ \ _ f An A12 \ f 

and 



M y I ^21 ^22 y V 



v+ul 0\fu+\ f Aj, A^.^f u 



) \ ) \ Aj^ A^^ ) \ 



> 



> 



implying that A2iV-^- < and AJ2U+ < 0. Since ^421 , ^412 > and 
Vjf- > 0, we can conclude that A12 = and A21 = 0. Then from 
(1.12) we have: 

ti+ o (||t;+||2n+ — A11V+) = and i;+ o (||u+||2i^_|_ — A^-^u+) = 0. 
Since u+, V-^- > 0, we have: 

= A11V+ and ||n+||2f+ = ^11^*+ 

or 

||ti+||2||-y+||iti+ = AiiAnU+ and 11^+1121^+112^+ = ^n^ii'U+. 

Setting a = ||'u+||2||t'+||2 and using the block diagonal structure of A 
yields the desired result. □ 

Theorem 3.1 guarantees that all stationary points of the rank-one 
case are nonnegative singular vectors of a submatrix of A. These results 
imply that a global minimizer of the rank-one NMF can be calculated 
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correctly based on the largest singular value and corresponding singular 
vectors of the matrix A. 

For ranks other than 1 and min(m,n), there are no longer trivial 
stationary points. In the next section, we try to derive some simple 
characteristics of the local minima of the nonnegative matrix factoriza- 
tion. 

The KKT conditions (1.12) help to characterize the stationary points 
of the NMF problem. Summing up all the elements of one of the condi- 
tions (1.12), we get: 



= ^(c/o(c/y^F-AF)). 



'13 
ij 

= {U,UV'^V-AV) 

= {UV'^,UV'^ - A) . (1.15) 
Prom that, we have some simple characteristics of the NMF solutions: 

Theorem 3.2. Let {U, V) be a stationary point of the NMF problem, 
then UV-^ € H the ball centered at ^ and with radius = 

Proof. From (1.15) it immediately follows that 

A ^^^^rp A I A A 



which implies 



2 '2 / \ 2 ' 2 



2 ' 2" " 



Theorem 3.3. Let {U, V) be a stationary of the NMF problem, then 

l\\A-UV^l = l{\\A\\l-\\UV^l). 
Proof. From (1.15), we have {UV'^,A) = {UV'^,UV^). Therefore, 

^{A-UV^,A-UV^) = ^{\\A\\l-2{UV^,A) + \\UV^\\l) 

= lm\l-\\uv^i). 

□ 

Theorem 3.3 also suggests that at a stationary point {U, V) of the 
NMF problem, we should have ||^|||' > HC/F-^Hl.. This norm inequality 
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can be also found in [7] for less general cases where we have VFfj = 
and VFy = at a stationary point. For this particular class of NMF 
stationary point, all the nonnegativity constraints on U and Vare in- 
active. And all such stationary points are also stationary points of the 
unconstrained problem, characterized by Theorem 2.2. 

We have seen in Theorem 2.2 that, for the unconstrained least-square 
problem the only stable stationary points are in fact global minima. 
Therefore, if the stationary points of the constrained problem are inside 
the nonnegative orthant (i.e. all constraints are inactive), we can then 
probably reach the global minimum of the NMF problem. This can be 
expected because the constraints may no longer prohibit the descent of 
the update. 

Let Ar be the optimal rank-r approximation of a nonnegative matrix 
A, which we obtain from the singular value decomposition, as indicated 
in Theorem 2.2. Then we can easily construct its nonnegative part [^r]+) 
which is obtained from Ar by just setting all its negative elements equal 
to zero. This is in fact the closest matrix in the cone of nonnegative 
matrices to the matrix Ar, in the Frobenius norm (in that sense, it is 
its projection on that cone). We now derive some bounds for the error 

\\A-[Ar] + \\F. 

Theorem 3.4. Let A,, be the best rank r approximation of a nonnegative 
matrix A, and let [Ar-]+ be its nonnegative part, then 

\\A-[Ar]+\\F< 11^- AIIf- 

Proof. This follows easily from the convexity of the cone of nonnegative 

matrices. Since both A and [Ar]+ are nonnegative and since [^r]+ is the 
closest matrix in that cone to Ar we immediately obtain the inequality 

\\A - AtWf > \\A - [Aj+Wl + WA- - > \\A - [ArUWl 

from which the result readily follows. □ 

The approximation [Ar]+ has the merit of requiring as much storage as 
a rank r approximation, even though its rank is larger than r whenever 
Ar / [^r]+- We will look at the quality of this approximation in Section 
6. If we now compare this bound with the nonnegative approximations 
then we obtain the following inequalities. Let U^,Vj' be an optimal non- 
negative rank r approximation of A and let UV'^ be any stationary point 
of the KKT conditions for a nonnegative rank r approximation, then we 
have : 

n 

\\A - [Ar]+\j, < \\A - Arfp = ^ (Ti < \\A - U^Vjfp < \\A - UV^fp. 

i=r+l 
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For more implications of the NMF problem, see [16]. 

4. Existing descent algorithms 

We focus on descent algorithms that guarantee a non increasing up- 
date at each iteration. Based on the search space, we have two categories: 
Full-space search and (Block) Coordinate search. 

Algorithms in the former category try to find updates for both U and 
V at the same time. This requires a search for a descent direction in the 
(m + ra)r-dimensional space. Note also that the NMF problem in this 
full space is not convex but the optimality conditions may be easier to 
achieve. 

Algorithms in the latter category, on the other hand, find updates 
for each (block) coordinate in order to guarantee the descent of the 
objective function. Usually, search subspaces are chosen to make the 
objective function convex so that efficient methods can be applied. Such 
a simplification might lead to the loss of some convergence properties. 
Most of the algorithms use the following column partitioning: 

\\\A-uvm = \±\\A,,-u(y.,ni (1-16) 

i=l 

which shows that one can minimize with respect to each of the rows 
of V independently. The problem thus decouples into smaller convex 
problems. This leads to the solution of quadratic problems of the form 

min -||a — ?7v|||. (1-17) 

v>o 2 

Updates for the rows of V are then alternated with updates for the rows 
of [/ in a similar manner by transposing A and UV^. 

Independent on the search space, most of algorithms use the Projected 
Gradient scheme for which three basic steps are carried out in each 
iteration: 

■ Calculating the gradient VF{x^), 

■ Choosing the step size a*^, 

■ Projecting the update on the nonnegative orthant 

^k+i ^ ^^k _ c,fevF(x'=)]+, 

where x'' is the variable in the selected search space. The last two steps 
can be merged in one iterative process and must guarantee a sufficient 
decrease of the objective function as well as the nonnegativity of the new 
point. 
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Multiplicative rules (Mult) 

Multiplicative rules were introduced in [20]. The algorithm applies 
a block coordinate type search and uses the above column partition to 
formulate the updates. A special feature of this method is that the step 
size is calculated for each clement of the vector. For the elementary 
problem (1.17) it is given by 



V 



a 



where [a'^]i = jifrffy]-- Applying this to all rows of V and U gives the 
updating rule of Algorithm 1 to compute 



(U*,V*)= argmin \\A-UV^\\l. 

U>0 V>0 



Algorithm 1 (Mult) 



1: Initialize U^, and k 
2: repeat 



yk+l = yfc o [-^"' ' 



[y'=(c/'=+i)T([/fc+i)] 

5: k = k + l 

6: until Stopping condition 



These updates guarantee aTitomatically the nonncgativity of the fac- 
tors but may fail to give a sufficient decrease of the objective function. 
It may also get stuck in a non-stationary point and hence suffer from a 
poor convergence. Variants can be found in [21, 24]. 

Line search using Armijo criterion (Line) 

In order to ensure a sufficient descent, the following projected gradient 
scheme with Armijo criterion [23, 22] can be applied to minimize 

X* = argmin F(x). 

X 

Algorithm 2 needs two parameters a and (3 that may affect its con- 
vergence. It requires only the gradient information, and is applied in 
[23] for two different strategies : for the whole space {U, V) (Algorithm 
FLine) and for U and V separately in an alternating fashion (Algorithm 
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Algorithm 2 (Line) 

1: Initialize x^, a, P, ao = 1 and k = 1 
2: repeat 

3: ttfc = ak-1 

4: y = [x^ - afcVF(x^)] + 

5: if F{y) - F{x'') > a{VF{x''),y - x'' ) then 

6: repeat 

7: ak = ak- /3 

8: y = [x'' - akVF{x'')]^ 

9: until F{y) - F{x^) < a {VF{x^),y - x^) 

10: else 

11: repeat 

12: lasty = y 

13: ak = ak/P 

14: y=[x''-akVF{x^)] + 

15: until F{y) - > a {\/F{x^),y - x^) 

16: y = lasty 

17: end if 

18: X^^^ = y 
19: = + 1 

20: until Stopping condition 
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CLine). With a good choice of parameters {a = 0.01 and /3 = 0.1) and 
a good strategy of alternating between variables, it was reported in [23] 
to be the faster than the multiplicative rules. 

Projected gradient with first-order approximation 
(FO) 

In order to find the solution to 

X* = argminF(x) 

X 

we can also approximate at each iteration the function F{X) using: 

F{x) = F{x'') + (V:aF{x''),x - x^^ + ^Wx'' - xg, 

where L is a Lipshitz constant satisfying F(x) < F{x), Vx. Because of 
this inequality, the solution of the following problem 

= argminF(x) 
x>0 

also is a point of descent for the function F{x) since 

F{xk+i) < F{xk+i) < F{xk) = F{xk). 

Since the constant L is not known a priori, an inner loop is needed. 
Algorithm 3 presents an iterative way to carry out this scheme. As 
in the previous algorithm this also requires only the gradient informa- 
tion and can therefore can be applied to two different strategies: to the 
whole space {U, V) (Algorithm FFO) and to U and V separately in an 
alternating fashion (Algorithm CFO). 

A main difference with the previous algorithm is its stopping criterion 
for the inner loop. This algorithm requires also a parameter /? for which 
the practical choice is 2. 

Alternative least squares methods 

The first algorithm proposed for solving the nonnegative matrix fac- 
torization was the alternative least squares method [25]. It is known 
that, fixing either U or V, the problem becomes a least squares problem 
with nonnegativity constraint. 

Since the least squares problems in Algorithm 4 can be perfectly de- 
coupled into smaller problems corresponding to the columns or rows of 
A, we can directly apply methods for the Nonnegative Least Square 
problem to each of the small problem. Methods that can be applied are 
[19], [6], etc. 
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Algorithm 3 (FO) 

1: Initialize x", Lq and A; = 

2: repeat 

3: y= [x^ - ^VF{x^)] + 

4: while F{y) - F{x'') > {V F{x^),y - x^) + ^\\y - x^\\l do 

5: Lfc = Lk/P 

6: Y = [X^ - ^VF{X^)] + 

7: end while 

8: X^^^ = y 

10: A; = A; + 1 

11: until Stopping condition 



Algorithm 4 Alternative Least Square (ALS) 
1: Initialize U and V 
2: repeat 

3: Solve: minv>o ^11^ - f^^^lll' 
4: Solve: min[/>oip^- 1/?7^ III 
5: until Stopping condition 



Implementation 

The most time-consuming job is the test for the sufficient decrease, 
which is also the stopping condition for the inner loop. As mentioned 
at the beginning of the section, the above methods can be carried out 
using two different strategies: full space search or coordinate search. In 
some cases, it is required to evaluate repeatedly the function F{U,V). 
We mention here how to do this efficiently with the coordinate search. 

Full space search: The exact evaluation of F{x) = F{U,V) = 
\\A — UV'^W'^ need 0{mnr) operations. When there is a correction 
y = [U + AU, V + AV), we have to calculate F{y) which also requires 
0{mnr) operations. Hence, it requires 0{tmnr) operations to determine 
a stepsize in t iterations of the inner loop. 

Coordinate search: when V is fixed, the Euclidean distance is a 
quadratic function on U : 

F{U) = \\A-UV^\\l = {A,A)-2{UV^,A) + {UV^,UV^) 

= \\A\\%-2{U,AV) + {U,U{V^V)). 

The most expensive step is the computation of AV, which requires 
0{mnr) operations. But when V is fixed, AV can be calculated once 
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at the beginning of the inner loop. The remaining computations are 
{U,AV) and (^U,U{V'^V)'), which requires 0{nr) and 0{nr^ + nr) oper- 
ations. Therefore, it requires 0{tnr'^) operations to determine a stepsize 
in t iterations of the inner loop which is much less than Oitmnr) oper- 
ations. This is due to the assumption r <^ n. Similarly, when U fixed, 
0{tmr'^) operations are needed to determine a stepsize. 

If we consider an iteration is a sweep, i.e. once all the variables are 
updated, the following table summarizes the complexity of each sweep 
of the described algorithms: 



Algorithm 


Complexity per iteration 


Mult 


Oijnnr) 


FLine 
CLine 


0(tmnr) 
0{tinr'^ + t2mr^) 


FFO 
CFO 


Oitmnr) 
Oitmr"^ + t2mr'^) 


ALS 
lALS 


0{2^rnnr)* 
0{mnr) 



where t, ti and t2 arc the number of iterations of inner loops, which 
can not be bounded in general. For algorithm ALS, the complexity is 
reported for the case where the active set method [19] is used. Although 
0{2'^mnr) is a very high theorical upper bound that count all the possible 
subsets of r variables of each subproblem, in practice, the active set 
method needs much less iterations to converge. One might as well use 
more efficient convex optimization tools to solve the subproblems instead 
of the active set method. 



Scaling and Stopping criterion 

For descent methods, several stopping conditions are used in the lit- 
erature. We now discuss some problems when implementing these con- 
ditions for NMF. 

The very first condition is the decrease of the objective function. The 
algorithm should stop when it fails to make the objective function de- 
crease with a certain amount : 

FiU'^\v'^^')-F{U\v')<e or < e. 

This is not a good choice for all cases since the algorithm may stop at a 
point very far from a stationary point. Time and iteration bounds can 
also be imposed for very slowly converging algorithms. But here again 
this may not be good for the optimality conditions. A better choice is 
probably the norm of the projected gradient as suggested in [23]. For 
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the NMF problem it is defined as follows : 

[V. 



Pi _ J [Vxjii if ^ij > 

^J*^ 1 min(0, [Vx]i,) iiXij = 



V 



yk 



< e 



(1.18) 



where X stands for U or V. The proposed condition then becomes 

F V 

We should also take into account the scaling invariance between U and 
V. Putting U = 7^7 and V = -^V does not change the approximation 

7 

UV^ but the above projected gradient norm is afi:ected: 



4 



\n\\F + llVf 11^ = ;^||V^||^ + 7l|Vt;||^1.19) 



F 



Two approximate factorizations = tJV^ resulting in the same ap- 

proximation should be considered equivalent in terms of precision. One 
could choose 7^ := ||V^||i?/||Vy which minimizes (1.19) and forces 
llV^lli? = ||V^||_p, but this may not be a good choice when only one of 
the gradients ||Vp||i? and ||V^||i? is nearly zero. 

■ " 

and any stopping criterion that uses gradient information is affected by 
this scaling. To limit that effect, we suggest the following scaling after 
each iteration: 



In fact, the gradient 



is scale dependent in the NMF problem 



Uk ^ UkDk Vk 
where is a positive diagonal matrix: 



VkD 



-1 



k\ii 




This ensures that 
ence between 



\u._, 



\V:i\\'jp and hopefully reduces also the differ- 



2 _ 
F - 

yfjWl and II v^ii^. 
The same scaling should be applied to the initial point as well {Ui, Vi] 



V^^llp. Moreover, it may help to avoid 



when using (1.18) as the stopping condition. 



5. 



Rank-one Residue Iteration 



In the previous section, we have seen that it is very appealing to 
decouple the problem into convex subproblems. But this may "converge" 
to solutions that are far from the global minimizers of the problem. 
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In this section, we analyze a different decoupling of the problem 
based on rank one approximations. This also allows us to formulate 
a very simple basic subproblem. This scheme has a major advantage 
over other methods : the subproblems can be optimally solved in closed 
form. Therefore it can be proved to have a strong convergence results 
through its damped version and it can be extended to more general types 
of factorizations such as for nonnegative tensors and to some practical 
constraints such as sparsity and smoothness. Moreover, the experiments 
in Section 6 suggest that this method outperforms the other ones in 
most cases. During the completion of the revised version of this report, 
we were informed that an independent report [8] had also proposed this 
decoupling without any convergence investivation and extentions. 

New partition of variables 

Let the Uj's and Vi's be respectively the columns of U and V. Then 
the NMF problem can be rewritten as follows : 

Problem 2 (Nonnegative Matrix Factorization). Given a m x n non- 
negative matrix A, solve 



1 

min -\\A—}uivf\\p. 



Ui>0 Vt>0 2 

Let us fix all the variables, except for a single vector vt and consider 
the following least squares problem: 

mmlwRt-utv'^Wl, (L20) 

where Rt = A — X^i^t u-ivf . We have: 

\\Rt-utv'^fF = trace [{Rt-utv^f{Rt-utv^)] (L21) 
= \\Rt\\l-2v'^R[ut + \\ut\\l\\v\\l. (L22) 

Prom this formulation, one now derives the following lemma. 

Lemma 5.1. // [R[ut]+ ^ 0, then := "fl"*" is the unique global 

minimizer of (1.20) and the function value equals ||i?t||p — -^^^^f-^Tir-^ • 

Proof. Let us permute the elements of the vectors x := R^ut and v such 
that 

Px = f , Pi; = r , with xi > 0, X2<0 
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and P is the permutation matrix. Then 

\\Rt - utv'^\\% = \\Rt\\F - 2vixi - 2V2X2 + \\ut\\l{vlvi + V2V2). 

Since X2 < and V2 > 0, it is obvious that ||-Rt — utv'^W'^ can only be 
minimal if V2 = 0. Our assumption implies that xi is nonempty and 
xi > 0. Moreover [Rfut]+ 7^ and ut > imply ||ttt||2 > 0, one can 
then find the optimal vi by minimizing the remaining quadratic function 

WRtWp - ^vjxi + -ui 

which yields the solution vi = ,, . Putting the two components to- 
gether yields the result 

[RjutU . IIP r||2 11PI12 ll[^^^t]+||i 
■y* = — n — TTn— and \\Rt - utv^ \\p = WRtUp n — fT9 — • 

□ 



Algorithm 5 (RRI) 



9 
10 
11 
12 

13 

14 
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Initialize UiS, Vi's, for i = 1 to r 
repeat 

for t = 1 to r do 

Rt = A-J2i^tUiv'[ 

if [Rfut]+ / then 

vt ^ 11' 

II"* 112 

else 

vt = 
end if 

if [Rtvt]+ / then 

[Rtvt] + 

else 

ut = 
end if 
end for 
until Stopping condition 



Remark 1: The above lemma has of course a dual form, where one 
fixes vt but solves for the optimal u to minimize ||-Rt — uvfW^. This 
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would yield the updating rules 

[RfutU [Rtvt]+ ,^ 

Il^«ll2 Il^^tll2 

which can be used to recursively update approximations Yll=i ''^i'^^I by 
modifying each rank-one matrix utvf in a cyclic manner. This problem 
is different from the NMF, since the error matrices Rt = A — ^^.^^ '^i'^T 
are no longer nonnegative. We will therefore call this method the Rank- 
one Residue Reration (RRI), i.e. Algorithm 5. The same algorithm 
was independently reported as Hierarchical Alternating Least Squares 
(HALS) [8]. 

Remark 2: In case where [Rfut]+ = 0, we have a trivial solution 
for = that is not covered by Lemma 5.1. In addition, if ut = 0, 
this solution is no longer unique. In fact, v can be arbitrarily taken 
to construct a rank-deficient approximation. The effect of this on the 
convergence of the algorithm will be discussed further in the next section. 

Remark 3: Notice that the optimality of Lemma 5.1 implies that 
\\A — UV'^W can not increase. And since ^ > fixed, UV^ > must be 
bounded. Therefore, its component Uivj (i=l. . . r) must be bounded as 
well. One can moreover scale the vector pairs {ui,Vi) at each stage as 
explained in Section 4.0 without affecting the local optimality of Lemma 
5.1. It then follows that the rank one products Uivf and their scaled 
vectors remain bounded. 



Convergence 

In the previous section, we have established the partial updates for 
each of the variable Ui or Vi. And for a NMF problem where the re- 
duced rank is r, we have in total 2r vector variables (the Uj's and Vj's). 
The described algorithm can be also considered as a projected gradient 
method since the update (1.23) can be rewritten as: 



Ut 



[Rtvt]- 



1^* 



[{A - ^. Uivf + utvf)vt]- 



\vt 



T.iUivf)vt + utvfvt]+ 



\vt 



Ut 



\vt 



|2 



Similarly, the update for Vi can be rewritten as 

1 



vt 



Vt 



Ut 



1 2 



Therefore, the new method follows the projected gradient scheme de- 
scribed in the previous section. But it produces the optimal solution 
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in closed form. For each update of a column vt (or n^), the proposed 
algorithm requires just a matrix- vector multiplication R^ut (or RtVt), 
wherein the residue matrix Rt = A — ^j.^^ UivJ does not have to be cal- 
culated explicitly. Indeed, by calculating Rfut (or RtVt) from A^ut (or 
Avt) and Yli^t'^^ii'^J ^i) Yli^t'^ii'^J '^t)) ■, the complexity is reduced 
from 0{mnr + mn) to only O {mn + {m + n){r — 1)) which is majored 
by 0{mn). This implies that the complexity of each sweep through 
the 2r variables u'^s and v'f.s requires only 0{mnr) operations, which is 
equivalent to a sweep of the multiplicative rules and to an inner loop of 
any gradient methods. This is very low since the evaluation of the whole 
gradient requires already the same complexity. 

Because at each step of the 2r basic steps of Algorithm 5, we compute 
an optimal rank-one nonnegative correction to the corresponding error 
matrix Rt the Frobenius norm of the error can not increase. This is a 
reassuring property but it does not imply convergence of the algorithm. 

Each vector ut or vt lies in a convex set Ut C or Vt C M". 
Moreover, because of the possibility to include scaling we can set an 
upper bound for \\U\\ and ||V^||, in such a way that all the Vt and Yt 
sets can be considered as closed convex. Then, we can use the following 
Theorem 5.1, to prove a stronger convergence result for Algorithm 5. 

Theorem 5.1. Every limit point generated by Algorithm 5 is a station- 
ary point. 

Proof. We notice that, ii ut = and vt = at some stages of Algo- 
rithm 5, they will remain zero and no longer take part in all subsequent 
iterations. We can divide the execution of Algorithm 5 into two phases. 

During the first phase, some of the pairs {itt,vt) become zero. Be- 
cause there are only a finite number (2r) of such vectors, the number of 
iterations in this phase is also finite. At the end of this phase, we can 
rearrange and partition the matrices U and V such that 

U = {U+ 0) and V = {V+ 0), 

where and V+ do not have any zero column. We temporarily remove 
zero columns out of the approximation. 

During the second phase, no column of U+ and V+ becomes zero, 
which guarantees the updates for the columns of and V+ are unique 
and optimal. Moreover, ^11^4 — Yll=i'^i''^I\\'F continuously differen- 
tiablc over the set Ui x . . . x Ur X Vi X . . . X V^, and the Uj's and 
Vj's are closed convex. A direct application of Proposition 2.7.1 in [4] 
proves that every stationary point (U^,V^) is a stationary point. It is 
then easy to prove that if there are zero columns removed at the end 
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of the first phase, adding them back yields another stationary point: 
U* = (C/+ 0) and V* = {Vj^ 0) of the required dimension. However, 
in this case, the rank of the approximation will then be lower than the 
requested dimension r. □ 

In Algorithm 5, variables are updated in this order: ui, vi, U2, V2, .... 
We can alternate the variables in a different order as well, for example 
ui, U2, Ur vi, V2, ■ ■ ■, Vr, .... Whenever this is carried out in a 
cyclic fashion, the Theorem 5.1 still holds and this does not increase the 
complexity of each iteration of the algorithm. 

As pointed above, stationary points given by Algorithm 5 may contain 
useless zero components. To improve this, one could replace UtvJ{= 0) 
by any nonnegative rank-one approximation that reduces the norm of 
the error matrix. For example, the substitution 

ut = ei* vt = [R[ut]+, (1.24) 

where i* = argmaxj ||[i?^ei]+|||, reduces the error norm by ej]+||| > 
unless Rt < 0. These substitutions can be done as soon as ut and vt 
start to be zero. If we do these substitutions in only a finite number of 
times before the algorithm starts to converge, Theorem 5.1 still holds. 
In practice, only a few such substitutions in total are usually needed 
by the algorithm to converge to a stationary point without any zero 
component. Note that the matrix rank of the approximation might not 
be r, even when all uts and uj's (t = 1 . . . r) are nonzero. 

A possibly better way to fix the problem due to zero components is 
to use the following damped RRI algorithm in which wc introduce new 
2r dummy variables Wi G Uj and Zi G Vj, where i = \...r. The new 
problem to solve is: 

Problem 3 (Damped Nonnegative Matrix Factorization). 

-'- II ,1 Tii2 ^ II II 2 II ii2 

«,>T.^>0 2 " 2- ^^^^ \\F + 1^1^\\U^-W,\\2 + -^\\V,-Z,\\2, 

m>o zi>o *=i « « 

where the damping factor tp is a positive constant. 

Again, the coordinate descent scheme is applied with the cyclic update 
order: ui, wi, vi, zi, U2, W2, V2, Z2, ■ ■ ■ to result in the following optimal 
updates for ut, vt, Wf and zt'. 



[Rtvt]+ + i^wt [Rfuth + i^zt 

'^t = — n — , / — > = Ut, Vt = — r. — , , — and zt = vt (1.25) 
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where t = 1 . . . r. The updates wt = ut and zt = vt can be integrated 
in the updates of ut and vt to yield Algorithm 6. We have the following 
results: 

Theorem 5.2. Every limit point generated by Algorithm 6 is a station- 
ary point of NMF problem 2. 



Algorithm 6 (Damped RRI) 



Initialize UiS, Vi's, for z = 1 to r 
repeat 

for t = 1 to r do 

Rt = A- Y^i^t 

[Rfut+-4>vt\ + 
[Rtvt+i'ut] + 

end for 
until Stopping condition 



Proof. Clearly the cost function in Problem 3 is continuously differen- 
tiable over the set Ui x . . . xU^ xUi x . . .xU^ xVi x . . . x x Vi x . . . xV^, 
and the Uj's and V^'s are closed convex. The uniqueness of the global 
minimum of the elementary problems and a direct application of Proposi- 
tion 2.7.1 in [4] prove that every limit point of Algorithm 6 is a stationary 
point of Problem 3. 

Moreover, at a stationary point of Problem 3, we have Ut = Wf and 
Vt = Zt, t = l...r. The cost function in Problem 3 becomes the cost 
function of the NMF problem 2. This implies that every stationary point 
of Problem 3 yields a stationary point of the standard NMF problem 
2. □ 

This damped version not only helps to eliminate the problem of zero 
components in the convergence analysis but may also help to avoid zero 
columns in the approximation when ^Jj is carefully chosen. But it is not 
an easy task. Small values of tp provide an automatic treatment of zeros 
while not changing much the updates of RRI. Larger values of ijj might 
help to prevent the vectors Ut and vt {t = 1 . . .r) from becoming zero too 
soon. But too large values of limit the updates to only small changes, 
which will slow down the convergence. 

In general, the rank of the approximation can still be lower than the 
requested dimension. Patches may still be needed when a zero compo- 
nent appears. Therefore, in our experiments, using the undamped RRI 
algorithm 5 with the substitution (1.24) is still the best choice. 
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Variants of the RRI method 

We now extend the Rank-one Residue Iteration by using a factor- 
ization of the type XDY^ where D is diagonal and nonnegative and 
the columns of the nonnegative matrices X and Y are normalized. The 
NMF formulation then becomes 

1 

where Xj's and Yj's are sets of normed vectors. 

The variants that we present here depend on the choice of Xj's and 
Yj's. A generalized Rank-one Residue Iteration method for low-rank 
approximation is given in Algorithm 7. This algorithm needs to solve a 
sequence of elementary problems of the type: 

maxy"^s (1-26) 

where y G M" and S C M" is a set of normed vectors. Wc first introduce 
a permutation vector ly = {i\ ■ ■ ■ in) which reorders the elements of 
y in non-increasing order : > k = 1 . . . {n — 1). The function 

p{y) returns the number of positive entries of y. 

Algorithm 7 GRRI 
1: Initialize Xj's, y^'s and dj's, for i = 1 to r 
2: repeat 

3: for i = 1 to r do 

4: Ri = A-J2j^idjXjyJ 

5: yi ^ argmax^gYi [xj Ris) 
6: Xi ^ argmax^gXi {yfRjc) 
7: di = xjRiyi 
8: end for 

9: until Stopping condition 



Let us first point out that for the set of normed nonnegative vectors 
the solution of problem (1.26) is given by s* = jj^j^- It then follows that 
Algorithm 7 is essentially the same as Algorithm 5 since the solutions Vi 
and Ui of each step of Algorithm 7, given by (1.23), correspond exactly 
to those of problem (1.26) via the relations y^ = Ui/\\ui\\2, yi = Vi/\\vi\\2 
and di = \\ui\\2\\vi\\2. 

Below we list the sets for which the solution s* of (1.26) can be easily 
computed. 
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Set of normed vectors: s = jAr- This is useful when one wants 
to create factorizations where only one of the factor U or F is 
nonnegative and the other is real matrix. 

Set of normed nonnegative vectors: s = 

Set of normed bounded nonnegative vectors {s}: where < < 
Si < Pi- The optimal solution of (1.26) is given by: 

fi ■ f y+ 

s = max /, mm p, 



Set of normed binary vectors {s}: where * = p[y and b G {0, 1}". 
The optimal solution of (1.26) is given by: 

1 if t < k* V 
, ~ where k* = argmax ^*~L * . 

otherwise k vk 

Set of normed sparse nonnegative vectors: all normed nonnegative 
vectors having at most K nonzero entries. The optimal solution 
for (1.26) is given by norming the following vector p* 

yi^ at <mm{p{y),K) 
otherwise 



bin = 



■ Set of normed fixed- spar sity nonnegative vectors: all nonnegative 
vectors s a fixed sparsity, where 

spar sity (s) = — JLJiiZILJk _ 

— 1 

The optimal solution for (1.26) is given by using the projection 
scheme in [17]. 

One can also imagine other variants, for instance by combining the 
above ones. Depending on how data need to be approximated, one can 
create new algorithms provided it is relatively simple to solve problem 
(1.26). There have been some particular ideas in the litcratTircs such 
as NMF with sparseness constraint [17], Semidiscrete Matrix Decom- 
position [18] and Semi-Nonnegative Matrix Factorization [11] for which 
variants of the above scheme can offer an alternative choice of algorithm. 

Remark: Only the first three sets are the normed version of a closed 
convex set, as required for the convergence by Theorem 5.1. Therefore 
the algorithms might not converge to a stationary point with the other 
sets. However, the algorithm always guarantees a non-increasing up- 
date even in those cases and can therefore be expected to return a good 
approximation. 
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Nonnegative Tensor Factorization 

If we refer to the problem of finding the nearest nonnegative vector 
to a given vector a as the nonnegative approximation in one dimension, 
the NMF is its generahzation in two dimensions and naturally, it can 
be extended to even higher-order tensor approximation problems. Al- 
gorithms described in the previous section use the closed form solution 
of the one dimensional problem to solve the two-dimensional problem. 
We now generalize this to higher orders. Since in one dimension such an 
approximation is easy to construct, we continue to use this approach to 
build the solutions for higher order problems. 

For a low-rank tensor, there are two popular kinds of factored tensors, 
namely those of Tucker and Kruskal [2]. We only give an algorithm for 
finding approximations of Kruskal type. It is easy to extend this to 
tensors of Tucker type, but this is omitted here. 

Given a d dimensional tensor T, we will derive an algorithm for ap- 
proximating a nonnegative tensor by a rank-r nonnegative Kruskal ten- 
sor S G i^^i^^zx.-.xnd represented as a sum of r rank-one tensors: 

r 

S = ^ UiUii ~kU2i-k ...i^Udi 

i=l 

where ai G M+ is a scaling factor, uu G M"* is a normed vector (i.e. 
II '"till 2 = 1) and a-kh stands for the outer product between two vectors 
or tensors a and h. 

The following update rules are the generalization of the matrix case 
to the higher order tensor: 

y = {.■■{{.. ■{RkUik)---Uit-i)k)u{t+i)k)---)udk (1.27) 

Gk = ||b]+||2, «tfe = ^, (1.28) 

where Rk = T — OiUu -k U2i ★ . . . * Udi is the residue tensor calcu- 

lated without the /c*'* component of S and R^Uij is the ordinary ten- 
sor/vector product in the corresponding dimension. 

We can then produce an algorithm which updates in a cyclic fashion 
all vectors Uji. This is in fact a direct extension to Algorithm 5, one can 
carry out the same discussion about the convergence here to guarantee 
that each limit point of this algorithm is a stationary point for the non- 
negative tensor factorization problem and to improve the approximation 
quality. 

Again, as we have seen in the previous section, we can extend the 
procedure to take into account different constraints on the vectors Uij 
such as discreteness, sparseness, etc. 
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The approach proposed here is again different from that in [9] where 
a similar cascade procedure for multilayer nonnegative matrix factor- 
ization is used to compute a 3D tensor approximation. Clearly, the 
approximation error will be higher than our proposed method, since the 
cost function is minimized by taking into account all the dimensions. 

Regularizations 

The regularizations are common methods to cope with the ill-posedness 
of inverse problems. Having known some additional information about 
the solution, one may want to imposed a priori some constraints to al- 
gorithms, such as: smoothness, sparsity, discreteness, etc. To add such 
regularizations in to the RRI algorithms, it is possible to modify the 
NMF cost function by adding some regularizing terms. We will list here 
the update for UiS and ViS when some simple regularizations are added 
to the original cost function. The proof of these updates are straight- 
forward and hence omitted. 

■ One-Norm ||.||i regularization: the one- norm of the vector variable 
can be added as a heuristic for finding a sparse solution. This is 

an alternative to the fixcd-sparsity variant presented above. The 
regularized cost function with respect to the variable vt will be 

^\\Rt-utv^\\l + (5\\v\\i, /3>0 
where the optimal update is given by 

* _ [I^Ut - /31nxl] + 
"^t — li iTo ■ 

ll«tlli 

The constant /3 > can be varied to control the trade-off between 
the approximation error \\\Rt — utv^W^p and \\v\\i. From this up- 
date, one can see that this works by zeroing out elements of li^ut 
which are smaller than /3, hence reducing the number of nonzero 
elements of . 

■ Smoothness regularization \\v — Bvt\\'p: where Vt is the current 
value of Vt and the matrix B helps to calculate the average of the 
neighboring elements at each element of v. When v is a ID smooth 
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function, B can be the following n x n matrix: 



B = 



( 

1 





V 



1 

2 





\ 




\ 



(1.29) 



/ 



This matrix can be defined in a different way to take the true 

topology of V into account, for instance v = vec{F) where F is a 
matrix. The regularized cost function with respect to the variable 
vt will be 



Rt - utv 



If + ol 



Bn 



t\\F^ 



5 > 



where the optimal update is given by 

[Rfut + dBvt] 



Ut 



The constant 5 >0 can be varied to control the trade-off between 
the approximation error ^\\Rt — utv'^W'^ and the smoothness of vt 
at the fixed point. From the update, one can see that this works 
by searching for the optimal update with some preference for 
the neighborhood of Bvi, i.e., a smoothed vector of the current 
value Vt- 

The two above regularizations can be added independently to each of 
the columns of U and/or V. The trade-off factor (3 (or S) can be different 
for each column. A combination of different regularizations on a column 
(for instance vt) can also be used to solve the multi-criterion problem 



2 II 



UtV 



+ 2I 



Bvt 



|2 



where the optimal update is given by 

^ [R^Ut - Plnxl + SBvt] 



\utU + S 



The one-norm regularizations as well as the two-norm regularization 
can be found in [1] and [3]. A major difference with that method is that 
the norm constraints is added to the rows rather than on the columns 
ofVorU as done here. However, for the two versions of the one-norm 
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regularization, the effects are someliow similar. Wfiile the two-norm 
regularization on the columns of U and V are simply scaling effects, 
which yield nothing in the RRI algorithm. We therefore only test the 
smoothness regularization at the end of the chapter with some numerical 
generated data. 

For more extensions and variants, see [16]. 

6. Experiments 

Here we present several experiments to compare the different descent 
algorithms presented in this paper. For all the algorithms, the scaling 
scheme proposed in section 4.0 was applied. 

Random matrices 

We generated 100 random nonnegative matrices of different sizes. We 
used seven different algorithms to approximate each matrix: 

■ the multiplicative rule (Mult), 

■ alternative least squares using Matlab function Isqnonneg (ALS), 

■ a full space search using line search and Armijo criterion (FLine), 

■ a coordinate search alternating on U and V, and using line search 
and Armijo criterion (CLine), 

■ a full space search using first-order approximation (FFO), 

■ a coordinate search alternating on U and V, and using first-order 
approximation (CFO) 

■ an iterative rank-one residue approximation (RRI). 

For each matrix, the same starting point is used for every algorithm. 

We create a starting point by randomly generating two matrices U and 
V and then rcscaling them to yield a first approximation of the original 
matrix A as proposed in Section 4.0: 

U = UD^, V = VD~^ 

where 

From (1.15), we see that when approaching a KKT stationary point 
of the problem, the above scaling factor a — > 1. This implies that every 
KKT stationary point of this problem is scale-invariant. 



\ii = j 
otherwise 
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e 




Mult 


ALS 


FLine 


CLine 


FFO 


CFO 


RRI 










(m=30, 


n=20, r=2) 








10" 


2 


0.02(96) 


0.40 


0.04 


0.02 


0.02 


0.01 


0.01 


10" 


3 


0.08(74) 


1.36 


0.12 


0.09 


0.06 


0.04 


0.03 


10" 


4 


0.17(71) 


2.81 


0.24 


0.17 


0.11 


0.08 


0.05 


10" 


5 


0.36(64) 


4.10 


0.31 


0.25 


0.15 


0.11 


0.07 


10" 


6 


0.31(76) 


4.74 


0.40 


0.29 


0.19 


0.16 


0.09 










(m=100 


n=50, r=5) 








10" 


2 


45*(0) 


3.48 


0.10 


0.09 


0.09 


0.04 


0.02 


10" 


3 


45*(0) 


24.30(96) 


0.59 


0.63 


0.78 


0.25 


0.15 


10" 


4 


45* (0) 


45*(0) 


2.74 


2.18 


3.34 


0.86 


0.45 


10" 


5 


46* (0) 


45*(0) 


6.93 


4.06 


6.71 


1.58 


0.89 


10" 


6 


46* (0) 


4B*(0) 


7.23 


4.75 


8.98 


1.93 


1.30 










(m=100, 


n=Ba, r=10) 








10" 


2 


46* (0) 


11.61 


0.28 


0.27 


0.18 


0.11 


0.06 


10" 


3 


46* (0) 


41.89(6) 


1.90 


2.11 


1.60 


0.74 


0.36 


10" 


4 


46*(0) 


4B*(0) 


7.20 


6.57 


5.08 


2.29 


1.13 


10" 


5 


45*(0) 


4B*(0) 


12.90 


9.69 


10.30 


4.01 


1.71 


10" 


6 


45*(0) 


4B*(0) 


14.62(99) 


11.68(99) 


13.19 


5.26 


2.11 










(m=100, 


n=50, r=15) 








10" 


2 


46*(0) 


25.98 


0.66 


0.59 


0.40 


0.20 


0.09 


10" 


3 


46*(0) 


45*(0) 


3.90 


4.58 


3.18 


1.57 


0.61 


10" 


4 


45* (0) 


45*(0) 


16.55(98) 


13.61(99) 


9.74 


6.12 


1.87 


10" 


5 


45*(0) 


45*(0) 


21.72(97) 


17.31(92) 


16.59(98) 


7.08 


2.39 


10" 


6 


46*(0) 


4B*(0) 


25.88(89) 


19.76(98) 


19.20(98) 


10.34 


3.66 










(m=100, 


n=100, r=20) 








10" 


2 


46*(0) 


42.51(4) 


1.16 


0.80 


0.89 


0.55 


0.17 


10" 


3 


45* (0) 


45*(0) 


9.19 


8.58 


10.51 


5.45 


1.41 


10" 


4 


45*(0) 


45«(0) 


28.59(86) 


20.63(94) 


29.89(69) 


12.59 


4.02 


10" 


5 


46* (0) 


4B«(0) 


32.89(42) 


27.94(68) 


34.59(34) 


18.83(90) 


6.69 


10" 


6 


46*(0) 


4B*(0) 


37,14(20) 


30.75(60) 


36.48(8) 


22.80(87) 


8.71 










(m=200, 


n=100, r=30) 








10" 


2 


46* (0) 


4B*(0) 


2.56 


2.20 


2.68 


1.31 


0.44 


10" 


3 


46* (0) 


4B«(0) 


22.60(99) 


25.03(98) 


29.67(90) 


12.94 


4.12 


10" 


4 


46*(0) 


4B*(0) 


36.49(2) 


39.13(13) 


46*(0) 


33.33(46) 


14.03 


10" 


5 


45*(0) 


4B*(0) 


45.(0) 


39.84(2) 


45*(0) 


37.60(6) 


21.96(92) 


10" 


6 


45*(0) 


45*(0) 


45* (0) 


45*(0) 


45* (0) 


45»{0) 


25.61(87) 



Table 1.1. Comparison of average successful running time of algorithms over 100 
random matrices. Time limit is 45 seconds. 0.02(96) means that a result is returned 

with the required precision e within 45 seconds for 96 (of 100) matrices of which the 
average running time is 0.02 seconds. 45 * (0): failed in all 100 matrices. 
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The algorithms are aU stopped when the projected gradient norm is 
lower than e times the gradient norm at the starting point or when it 
takes more than 45 seconds. The relative precisions e are chosen equal 
to 10-2, 10"^, 10"^, 10-^ 10"^. No limit was imposed on the number 
of iterations. 

For alternative gradient algorithms CLine and CFO, we use different 
precisions eu and ey for each of the inner iteration for U and for V as 
suggested in [23] where ejj and ey are initialized by 10~^. And when the 
inner loop for UovV needs no iteration to reach the precision eu or ey, 
one more digit of precision will be added into eu or ey (i.e. eu = e[//10 
or ey = ey/10). 

Table 1.1 shows that for all sizes and ranks, Algorithm RRI is the 
fastest to reach the required precision. Even though it is widely used in 
practice, algorithm Mult fails to provide solutions to the NMF problem 
within the allocated time. A further investigation shows that the algo- 
rithm gets easily trapped in boundary points where some Uij and/or Vij 
is zero while V^/.^ and/or Vy.^. is negative, hence violating one of the 
KKT conditions (1.11). The multiplicative rules then fail to move and 
do not return to a local minimizer. A slightly modified version of this 
algorithm was given in [21], but it needs to wait to get sufficiently close 
to such points before attempting an escape, and is therefore also not 
efficient. The ALS algorithm can return a stationary point, but it takes 
too long. 

We select five methods: FLine, CLine, FFO, CFO and RRI for a 
more detailed comparison. For each matrix A, we run these algorithms 
with 100 different starting points. Figure 1.1, 1.2, 1.3 and 1.4 show 
the results with some different settings. One can see that, when the 
approximated errors are almost the same between the algorithms, RRI 
is the best overall in terms of running times. It is probably because the 
RRI algorithm chooses only one vector ut or vt to optimize at once. This 
allows the algorithm to move optimally down on partial direction rather 
than just a small step on a more global direction. Furthermore, the 
computational load for an update is very small, only one matrix-vector 
multiplication is needed. All these factors make the running time of the 
RRI algorithm very attractive. 

Image data 

The following experiments use the Cambridge ORL face database as 
the input data. The database contains 400 images of 40 persons (10 
images per person). The size of each image is 112x92 with 256 gray levels 
per pixel representing a front view of the face of a person. The images 
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Figure 1.1. Comparison of selected algorithms for e = 10 
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Figure 1.2. Comparison of selected algorithms for e = 10 
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m = 100, n = 50, r = 15, E = 10 
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Figure 1.3. Comparison of selected algorithms for e = 10 
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Figure I.4. Comparison of selected algorithms for e = 10 
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are then transformed into 400 "face vectors" in M1°304 (112 x 92 = 10304) 
to form the data matrix A of size 10304 x 400. We used three weight 
matrices of the same size of A (ie. 10304 x 400). Since it was used in 
[20], this data has become the standard benchmark for NMF algorithms. 



Frobenius Error 




- MuN (94s) 

■ FLine (201s) 

- CLine (3274s) 

- FFO (226s) 

■ FCO (1096s) 
-RRI(118s) 



Figure 1.5. NMF: Error vs. Iterations 



In the first experiment, we run six NMF algorithms described above 
on this data for the reduced rank of 49. The original matrix A is con- 
stituted by transforming each image into one of its column. Figure 1.5 
shows for the six algorithms the evolution of the error versus the number 
of iterations. Because the minimization process is different in each algo- 
rithm, we will say that one iteration corresponds to all elements of both 
U and V being updated. Figure 1.6 shows the evolution of the error 
versus time. Since the work of one iteration varies from one algorithm 
to another, it is crucial to plot the error versus time to get a fair com- 
parison between the different algorithms. In the two figures, we can see 
that the RRI algorithm behaves very well on this dataset. And since its 
computation load of each iteration is small and constant (without inner 
loop), this algorithm converges faster than the others. 

In the second experiment, we construct a third-order nonnegative ten- 
sor approximation. We first build a tensor by stacking all 400 images to 
have a 112 x 92 x 400 nonnegative tensor. Using the proposed algorithm, 
a ranA: — 142 nonnegative tensor is calculated to approximate this tensor. 
Figure 1.7 shows the result for six images chosen randomly from the 400 
images. Their approximations given by the rank-142 nonnegative tensor 
are much better than that given by the rank-8 nonnegative matrix, even 
though they require similar storage space: 8 * (112 * 92 + 400) = 85632 
and 142 * (112 + 92 + 400) = 85768. The rank-8 truncated SVD approx- 
imation (i.e. [^8]+) is also included for reference. 
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Figure 1.6. NMF: Error vs. Time 






Figure 1.7. Tensor Factorization vs. Matrix Factorization on facial data. Six ran- 
domly chosen images from 400 of ORL dataset. From top to bottom: original images, 
their rank — 8 truncated SVD approximation, their rank — 142 nonnegative tensor 
approximation (150 RRI iterations) and their rank — 8 nonnegative matrix approxi- 
mation (150 RRI iterations). 
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In the third experiment, we apply the variants of RRI algorithm men- 
tioned in Section 5.0 to the face databases. The following settings are 
compared: 

■ Original: original faces from the databases. 

■ 49NMF: standard factorization (nonnegative vectors), r = 49. 

■ lOOBinary: columns of U are limited to the scaled binary vectors, 
r = 100. 

■ 49SparselO: columns of U are sparse. Not more than 10% of the 
elements of each column of A are positive, r = 49. 

■ 49Sparse20: columns of U are sparse. Not more than 20% of the 
elements of each column of A are positive, r = 49. 

■ 49HSparse60: columns of U are sparse. The Hoyer sparsity of 
each column of U are 0.6. r = 49. 

■ 49HSparse70: columns of U are sparse. The Hoyer sparsity of 
each column of U are 0.7. r = 49. 

■ 49HBSparse60: columns of U are sparse. The Hoyer sparsity of 
each column of U are 0.6. Columns of V are scaled binary, r = 49. 

■ 49HBSparse70: columns of U are sparse. The Hoyer sparsity of 
each column of U are 0.7. Columns of V are scaled binary, r = 49. 

For each setting, we use RRI algorithm to compute the corresponding 
factorization. Some randomly selected faces arc reconstructed by these 
settings as shown in Figure 1.8. For each setting, RRI algorithm pro- 
duces a different set of bases to approximate the original faces. When the 
columns of V are constrained to scaled binary vectors (lOOBinary), the 
factorization can be rewritten as UV^ = IJB^ , where S is a binary ma- 
trix. This implies that each image is reconstructed by just the presence 
or absence of 100 bases shown in Figure 1.9. 

Figure 1.10 and 1.11 show nonnegative bases obtained by imposing 
some sparsity on the columns of V . The sparsity can be easily controlled 
by the percentages of positive elements or by the Hoyer sparsity measure. 

Figure 1.12 combines the sparsity of the bases (columns of U) and 
the binary representation of V. The sparsity is measured by the Hoyer 
measure as in Figure 1.11. Only with the absence or presence of these 
49 features, faces are approximated as showed in the last two rows of 
Figure 1.8. 
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Figure 1.8. Nonnegative matrix factorization with several sparse settings 
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Figure 1.9. Bases from lOOBinary setting 
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(a) (b) 




Figure 1.10. Sparse bases 49Sparse20 and 49SparselO. Maximal percentage of 
positive elements is 20% (a) and 10% (b) 



(a) (b) 




Figure 1.11. Hoyer sparse bases 49HSparse60 and 49HSparse70. Sparsity of 
bases is 0.6 (a) and 0.7 (b) 
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Figure 1.12. Hoyer sparse bases 49HBSparse60 and 49HBSparse70. Sparsity of 
bases is 0.6 (a) and 0.7 (b). V is binary matrix. 



The above examples show how to use the variants of the RRI algo- 
rithm to control the sparsity of the bases. One can see that the sparser 
the bases are, the less storage is needed to store the approximation. 
Moreover, this provides a part-based decomposition using local features 
of the faces. 

Smooth approximation 

We carry out this experiment to test the new smoothness constraint 
introduced in the previous section: 

]^\\Ri - UiV^\\]? + ^\\v - Bvi\\]p, 5>0 

where B is defined in (1.29). 

We generate the data using four smooth nonnegative functions /i, /2, 
/s et /4, described in Figure 1.13, where each function is represented as 
a nonnegative vector of size 200. 

We then generate a matrix A containing 100 mixture of these functions 
as follows 



A = max{FE'^ + N, 0) 



Descent methods for Nonnegative Matrix Factorization 



43 



0.18 




50 100 150 200 50 100 150 200 



Figure 1.14- Randomly selected generated data 

where F = [fi f2 /s /4], E is a random nonnegative matrix and 
is normally distributed random noise with \\N\\f = 0.2\\FE'^\\f. Four 
randomly selected columns of A are plotted in Figure 1.14. 




We run the regularized RRI algorithm to force the smoothness of 
columns of U. We apply, for each run, the same value of 5 for all 
the columns of C/: 5 = 0, 10, 100. The results obtained through these 
runs are presented in Figure 1.15. We see that, without regulariza- 
tion, i.e. 5 = 0, the noise is present in the approximation, which pro- 
duces nonsmooth solutions. When increasing the regularizing terms, i.e. 
5 = 10, 100, the reconstructed functions become smoother and the shape 
of the original functions are well preserved. 

This smoothing technique can be used for applications like that in 
[27], where smooth spectral reflectance data from space objects is un- 
mixed. The multiplicative rules are modified by adding the two-norm 
regularizations on the factor U and V to enforce the smoothness. This 
is a different approach, therefore, a comparison should be carried out. 

We have described a new method for nonnegative matrix factorization 
that has a good and fast convergence. Moreover, it is also very flexible 
to create variants and to add some constraints as well. The numerical 
experiments show that this method and its derived variants behave very 
well with different types of data. This gives enough motivations to ex- 
tend to other types of data and applications in the future. In the last 
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two chapters of this thesis, it is applied to weighted cost functions and 
to symmetric factorizations. 

7. Conclusion 

This paper focuses on the descent methods for Nonnegative Matrix 
Factorization, which are characterized by nonincreasing updates at each 
iteration. 

We present also the Rank-one Residue Iteration algorithm for com- 
puting an approximate Nonnegative Matrix Factorization. It uses re- 
cursively nonnegative rank one approximations of a residual matrix that 
is not necessarily nonnegative. This algorithm requires no parameter 
tuning, has nice properties and typically converges quite fast. It also 
has many potential extensions. During the revision of this report, we 
were informed that essentially the same algorithm was published in an 
independent contribution [8] and also mentioned later in an independent 
personal communication [12]. 
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